TTE (LEE VRE) RE LiIC12 -- Neural Network 


VES: TERİ ARE: AË (id: redstonewill) 


ESARANEEINZ 7 Gradient Boosted Decision Tree, GBDTj&it(#FAfunctional 
gradienth 7413272 RAIA, AEA steepest descentf75 x48 P aR 
MARAE, REYUAXAIE(EHAMmiEerror measure, EDiRiMAHIGBDTEL 
regression hH TAMAKI, (FARIS squared error measure, AERA ME—RAL 
MIRE, (AŞ FARR KASPAR, yeti EZ (Neural 


Network) . 


Motivation 
ACHES VSARE, IFE Perceptroni8 y, JDL ASS, 


Perceptron fi; B7EREg, (a) 2 BUT. E— signe, EXB73C1,-1). WE, WRES 
perceptrons4e tH ASER, (ZARA GUN P EB: 










oz 


oa: 
as. 








7 T 
-— EX é » G(x) = sign É a, sign wi) 
e d on = —sT. 
Ze a(x) 


| | Jm T s two layers of weights: 
EN. © w; and a 


| | VA ER o + two layers of sign functions: 


ao in gr and in G 











J]S# 150980 A (Zo, 21,22, tq) STARA (w, wa, <<, wr) tee (87h 
w; d+) , 1B8EUTNAEÉSperceptronsH(g1,492,:::,97). Rin, #g.â 
FATA (0, aw, : orl, SES, GWE— T perceptronfs 8/, 


Mt Rit, Lenze raus, ëlo Silo, ANte8A7HE 
Asin, SENG, FIG, SES DEE ATI linear aggregationAYiSALŞE 


SEHI+ARFAS boundarylZ? 


aS ERAY Z, S FEAR, gı lg pal FA EM perceptrons, BH, ABR 
7-1, ISÉZETR+1. RARA perceptrons£x EH ASA) REAL, DRA 
Soa Mgs (AND) AJF, (ber), 


+1 





91 go AND(9:, 92) 


QA SAN URES LIRA AN D( 91, go EEE? rr ER E 
Flo = —lay = +l,os = +1. XE, G(x) MAAN: 


G(x) = sign(—1+ gı(z) + g2 (z)) 
gi Age ASAE SL). A1). 491 = —1, go = —1Rj, G(x)-0; Y 


gı = -—l,g = +1RJ, G(x)=0; Yn = +1, 92 = —1Hj, G(x)=0; 4 
gı = +1, 92 = IDN, G(x)=1, RRA LRA FATAR: 


G(x) = sign (—1+91(x)+g2(x)) 


-" Q--- © gi(X) = go(x) = +1 (TRUE): 
a NS G(x) = +1 (TRUE) 
AS e D o . otherwise: 
Xo — ( EECH G(x) = —1 (FALSE) 

LT Z e G=AND(Q1, 92) 


X = 1 


bi 











Xd i 
P< lf iA y AAA, WMA ee, ESIESAMIEA, 4 
BERSA, FİLME SARAN SANIR (ANDIZÈ) G(x). 
elt. zt, BK (OR) SUE (NOT) i8 SEBBRTLAERHIRSAUB USE EES, dERS 
AY, 


Bri, linear aggregation of perceptronsScip_ E EFE powerful AA tb.E3EX 
complicatedfE 72, BE RNA— NAT, NR AYR ES BUE IE, Biz, 
lat 1, (GE DI AİN ASE FAA perceptron K HARAI. WRF 
Fa8‘Sperceptrons, BRI ASA AE tk, BEREGHSI— MR SUE RA A 
(NB) . ANFR(ER16 Nperceptrons, 8BZÁS3IBSZJEESREERISUEZ (TAW). 
Att, (SFARIperceptronsüzZ, wWiüBets-sllesThExSHJconvex set, BSW. 
Zİ MENLEE AAA PINS, convex setff VC Dimensions FIA (2% 

) . RARA perceptronsié4£, FAK SAES IRV, GIS) (Le, 
PAYNE et SARE RES OGRA, MERIMA (overfiting) . 





8 perceptrons 16 perceptrons target boundary 


s ‘convex set’ hypotheses implemented: dys — œ, remember? :-) 
s powerfulness: enough perceptrons = smooth boundary 


RAR, EEE A Ayperceptronsée (+48 EE Sal EE İRAT SUE CDS, 
xxtaz&aggregationB gaz —. 

(BE, th Eperceptronsse A SİN ANAYA. GIRONA BHTEASAND, OR, 
NOT FE Sapo LA perceptrons(q2, MAN ERKEK (XOR) KWE, mus 
RİNA MH AEEperceptrons EM. MBAAXORSZIN SELES Kin, SI EE 
Pian, i&tZWEBgitlgoZeESSN. ArLAibtlinear aggregation of perceptronst& 
AYSER elef 


gı Ge XOR( 91, 92) 
e limitation: XOR not ‘linear separable’ under (x) = (g1(x), g2(x)) 


HBA, ASSCMXOREYE, BALAEASEperceptrons, theteit—Rtransform7 
fr, FAIRAZEMtransform, iXEASIRÈEBasic Neural NetworkAY£ NAM, FA 


ilmi APA ESperceptrons3e Sc HX ORBSF, 
Bic, Him, FakXOREEDILGRAEK: 
XOR(91, 92) = OR(AND(—91, 92), AND(g1, —92)) 


JE SC Ente; y ABtransform, #—E(KAANDEME, 88 E AEORBEME, 


OHARA RAY REL FATAR: 
o . . 


X m 7 


Xd ai 





xi, MANDEM-EZIIXOREME, MiB) AYaggregation of perceptrons&lmulti-layer 
perceptrons, AHURU, RERA EBES, (FERESY6CHEFASS 
WH EA öz iy eze, AEE REAN, 


perceptron (simple) 
=> aggregation of perceptrons (powerful) 
=> multi-layer perceptrons (more powerful) | 


Inge TR, XB ATA RO USU Soin Los CCS A SEI Cl (XL 
Neural Network Z np?) . KADAV RASA BO ZAZA TUB T2 
dendrite, EKA T5 RASH So] CASH SSaxon, 


Neural Network Hypothesis 
-E—8B2 EYİ İMA AYAN ARAL ESCHIENeural Network, WARDA 


EIS, #HAJ —İE—İRliYtransform, Rima RAINE, BaL— pa 
score, BD£EOUTPUT E, unt DEERE, 458s, RR TUE. 








= © e OUTPUT: simply a 

\ linear model with 

x 
OUTPUT —— = wp 

0 6n ee i) 

“A * any linear model can be 

( used—remember? :-) 

Xd 


Sei BUE AAA! : linear classification, linear regression, logistic 
regression, ABA, XJJ-OUTPUTIEERS Ns, RES MM, EE AIBA 
#8. *5gERbinary classification, mJL/filinear classification; Wee 
linear regression[aJ&, ALE Elinear regression; &$[|5RERsoft classification/a] 
Em, Waylbbkt#logistic regressionts/, APH F3ETSLAlinear regression734], 3E 
#Esquared error k HTE. 





linear classification linear regression logistic regression 
h(x) = sign(s) h(x) = s h(x) = 6(s) 

Xo Xo Xo 

An £ Xj Xi 

X3 h(x) X; =: h(x) | x; a h(x) 

Xa Xa Xa 

er = 0/1 err = Squared ert = cross-entropy 





Ei tB OUTPUT, SRE, SNBANN—iperceptron, &95—^ 
transformà., ESGEijE ef Atransformation function£&B/rEeERZAsign(), AB 
KTS sign) RAG, iet Att transformation functionlz? 


İREM gaBf transformation function ELİE (IROUTPUTSS—fE) , BEZEH 
NT Ra YE ET LA A PRA TERES Pod £T AEA. Jos Bee AME 
ENA RİZA AR, RHR, RER y EZ 4 p B^. AU 
ARK, AiR SHARE. 

ANFEN Fab transformation functiong b EMMER (BDsign()ERZ) . xE MNE 


RERA, (AS AFUE RaS aa, FARMS, PRACT SY ER 
AH, FALA, — BU EEM ESRAZYEXStransformation function, 


BASA Res MERAK AE /FAtransformation function, AAs aS bp. 
transformation functioniğizEtanh(s), EET: 


exp(s) — ezp(—s) 
ezp(s) + exp(—s) 

tanh(s) RE — MERA, Zeie? KERAKKI, tanh(s) 5ER 
ir; sItbGSzANASAHIE, tanh(s) ZERRE LİR. MERGE tzend, ATAN EE 
DS, EFRA. MERREM, BG, LASE 
REZO CDS, 


Juge E. tanh(x) Ret SsigmoidRaeet DİAZ: 
tanh(s) = 20(2s) — 1 


tanh(s) = 


Ed, 


1 


AE 1 + exp(—s) 


e | : transformation function 
of score (signal) s 
e any transformation? 


J P : whole network linear & 
thus less useful 
: discrete & thus hard to 
optimize for w 
e popular choice of 
transformation: / = tanh(s) 
e ‘analog’ approximation of 
- | : easier to optimize 
* somewhat closer to 
biological neuron tanh(s) = 
e not that new! :-) 








exp(s) — exp(—s) 
exp(s) + exp(—s) 
= 20(2s)-—1 





BBA, EP AR mE Atan Aaah RAYtransformation function, PF 
ABE, SOMA, DLR eAytransformation function, ZE 
AYtransformation function, WERE. 


TARIH 282 A A Neural Network HypothesisBSZ&t4, SEA, (ZAM 
ZDE AP, HARARE, ALYE. BALE, FAERMABA 
Ale, ApEn Nese. SLE, wH. 





Neural Network Hypothesis#, dd)... dO RAZI), Et 
a WE. SIDE Testes, L-3. like eer Eine 

, wil SIS L, SRE. FRE < í < di", xmi 
— git MUI biast (SM). Pinel < j< dÜ) Sënn, 
24 (AEfEbiasih). 
WHEN score, CARAT: 


dn) 
UN (D) (—1) 
— bè Wij Ti 
1=0 
XFEEKItransformation function, CGDIS dh: 


Wo tanh(s\), ifl«L 
a, ifl= L 


Ris regression, ARLENE (FL) BREAL? = s0, 





q(9)-q()-g(2)-. ..-d Neural Network (NNet) 


AL layers gt-» 
"u | 0<i<d‘") inputs , score s SS x wh (1) 
1<j<d© outputs 

tanh (sn) if £ < L 
ge ifl =L 


A 


transformed xj? ={ 


fr£8z&Neural Network HypothesisWAM fa, FAIS MXP ALAKA 
ASAR M. Rea LAKE ESE, e—ZAKATIAXHAIZAWA, 
In LAB E—Fhtransformation, VOD ETE EFE) . ERE 
AXFDREEwWÈS RARA, Etan, (Ben, MEIA, -AAH 


45. Sch, RET AWA wo ell izi, AkZtanh(m bien 
1, BAiZ#ftransformationk REF. BIER, WEN, RRRA, RHA 
A NAENTREX, BEAT, MURR WAX Bit LANE, A, Set AE 
BB URS BRAA [8] ERI ES e] Ev Et, LAIAE(LE, cee AT, ABA 
transformation NAL Lir, SGS ANE RAHA Agee, tai, 4 
MIZE VARA SS pattern extraction, BI H Sch sudes ES SB SEEURURA 
@, mr E Real, ele NIP EE SD SET Su, SE 
HESE 


e each layer: transformation to be learned from data 


BIER 
(£) ,(€-1) 
e o? (x) = tanh Bs Uns 


—whether x ‘matches’ weight vectors in pattern 
NNet: pattern extraction with 
layers of connection weights | 


Neural Network Learning 





3i TE T Neural Network HypothesisA9£5#9A1 A772. EE ASE 
ME SEİNE). Spee Cent, EIRENE wi) (error 
BO? SERVES. 





Ei, BADE EİREİKEBAEMYU D LEE, ({wl? YEME. ane Ss, 
MEL TF Zaggregation of perceptrons, EILUEFARAN E3841 4889gradient 
boostingSjAR—t— MAESES MATHS, AN Baba Enn ES RT TAL 


JERIC&RTESGAL ABUS), MRSS ASIA WORSE aA 
BZ, aggregation of perceptronshy FARINAS. Së te bi Co, 


#error function, MAHERE, FAI VA AAS 
TEMEZIEAIsquared error: en = (Yn — NNet(z,))”, KBB MEAAAMerror, 
BRA, GEIER, SE Mw) HEM, St let 
wi? RAN, Ett, BEEE e NA Wi. 


ij 
e one hidden layer: simply aggregation of perceptrons 
—dgradient boosting to determine hidden neuron one by one 
e multiple hidden layers? not easy 
e let en = (Yn — NNet(x,))*: | 
can apply (stochastic) GD after computing 5! 
Wij 


e goal: learning all {wh} to minimize Ein wi 


rte, SSA w; WEA, Rie Mul) AGES, RREN 
HAL s, en ul, BMAD: 





2 
din 
n= Mat = (s - db) t= (jan vier? 


HBe wi BRE, SEI: 


specially (output layer) 





(0 < í < RI 
den 
aw) 


de, os) 
as) m 


= —2 (Yn = si”) : Kei 


ULERHEKA SAYAR. URERCE, MAL, RS SNS neo NE 
xt: 


generally (1 « 
EZE: Ve 





j ij 
= 9. (x) 


LES, Se, SHIBR NESTED As,” e Si nd 
den 





ll 
as”) = 6j 


J 
MI — Lg, AP = —2(y, — si”); 41 Z LIS, 5. BARREN, rotes 
BES, EIS ESER. 


£+1 
st") 
(0) tanh (9) “A 
£ n í) jk E 
Si — ae, = dr = ver => On 


AD Lem, NANA! erter, SRO”, BST 
-EREU Pig, GERİ), BRISA. 


BBA, URB Ems AIR, RAN XS ESL (Herbs 
EHAAEER, bread 


1) ` I : 
em ee 
as; k=1 ast” Ge? as 


E (i?) (wf) (tan (4?) 





A 





SHOE, LPS SAI, Stein, RAE 
fs. Si Emtee" BSR, Bs Assis, wish. 


BRE, SEDET A 55, WHER, stunn" A, AA, Ee 


SORE. TEE, MAHE = —2(y, — si”), MAREE 
HES, ais E50 , MITTEE, İS Dul) oli Pn. AM 


Zia, MAACESS AH TAEAE, RAEI. 





w 


h, MRES o APRABackpropagation Algorithm, PIRH 
UrSJBSBPTREENARENE. CARARE FA: 


Backprop on NNet 


initialize all weights w! ) 
Wii... 
© stochastic: randomly pick n € {1,2,--- , N} 
@ forward: compute all xw ) with x) = x, 
© backward: compute all subject to x) = x, 


© gradient descent: w; + wi? — gx! 59 


j 


return gynet(X) = ( -tanh (>; Wi.) -tanh (X wjx))) 


-EIESKFHRUEESGDRSJS A, BANİ gn NA, BML. 
PLGA Amini-batchhye, BIER iex, Gi, STII, E 
ARYEL w. AMANITA 5. 


Optimization and Regularization 


ZI EASES, PUTAS HAMA Birisi E; (w)R. AT 
REJAH error measurezésquared error, HAEA RAR CAR AE, A 
ETES REM lr, KABÈS. 


N 
En(w) = iS er (( -tanh (= wi? - tanh [> 222J] >) 
n=1 j i 


FRIES TOA MAE, HME SNR i 
HEIR, SMR RANEY, KULE, (w) ARES S RENE, E 
non-convexÉl9, RESSA (globalminimum) gr. MER FHGD Ek 
SGD;A/SAAYR RE Ea İVE (local minimum) , 

FRANSE, KAAN | EN AGZAIXAAIIocal minimum, We 
HG SNE RAMAK. EU) Et SIS, EAR 
hang, THEREBY randoms, ER, MEW RA, MARE 
tanh, SEIS ee KAP (AAFWAlsaturation) , RHS 
PER, SİLE BEYE SS SELL, MESS LSTA. TON 
BRO R ABER w'y MASS elele, Reeigistrandom, MERMİ 


beee, EE SAARRATIR, JES UE 
TRÈ. 


e generally non-convex when multiple hidden layers 
e not easy to reach global minimum 
e GD/SGD with backprop only gives local minimum 


» different initial wi ) — different local minimum 


e somewhat ‘sensitive’ to initial weights 
e large weights — saturate (small gradient) 
e advice: try some random 8 small ones 


FS ES — Fall DM Dimension, XiFtanhiX#FBYtransfer 
function, FOXY NRE ER Ed = O(VD). EVER PRAT 
š (4 EXf&biasga) DTPA OI, BLA, SUERVIESÉXBSJRHME, VC 
Dimension BAJEEX, KAFALA A EA ERAR, (BIS tA RES 
ipii Aovertitting. ATLA, wë ër CDEN EN BEAN A. 


73 Y BB 1EtRB22 uzi, NRA AEEA regularization, ZAN 
fei Lëeror function PyNA—*regularizer, 4JA0EAZZASL2 regularizer Q(w): 


Q(w) = Y (wy 


(BE, f#FBL2 regularizer &3— DRA, Be CES MX ERİTİLİR 

(shrink) , tEBEEEIEABOBUEARNEREEEA, OEE BUN, kee 
NIB, RESO MESSI SERIES, MRIS MEW!) — 0 
, BERIE (sparse) RU, EEDSIXEBETSRUR/P VC Dimension, AIDS 
HERE, SR, 


BRAD Y (SSllsparsefi£, nl Ae? FAN] Bum SI LA 6EFBL1 regularizer: 
Y. |wij |, (BEixMBUETERE— NE, MEMEK SAMİ. MEZ, 
BAERS FEX wWuEBBFweight-elimination regularizer, weight-elimination 
regularizerZ&-F L2 regularizer, HIA ETEL2 regularizer Lin P REHE, XE 
Béfselarge weightf[Ismall weight&BSE(S SIS EEREBSAR/]N, mt eneen 


Be 


SE. weight-elimination regularizerB RJAHÂN FE: 


> 





l 
(w)? 
1 + (wl?) 


2 


e ‘shrink’ weights: 
large weight — large shrink; small weight — small shrink 
e want wid = 0 (sparse) to effectively decrease d, 
e L1 regularizer: > ` wi. but not differentiable 


e weight-elimination ('scaled' L2) regularizer: 
large weight — median shrink; small weight — median shrink 


BRT weight-elimination regularizerzZ^h, i4$53^5—^ &t83:HJregularizationd75 
ik, Biz2Early Stopping. SMEL., MEENET. AA, tK 
BRE, HST RH SRB EDIT eet, ABSA, VC Dimensioniğ 
A, AlgEZovertitting. MAAR, SEGURULP Dimension, KATRE SÈ, 
MTE2Zregularization#9RXR. Hin ME est BB AEM AKA SEG FAR: 


e GD/SGD (backprop) visits e 


more weight combinations model complexity 
as t increases 





m 
— de VC dimension, dye 


(dj, in middle, remember? :-)) 


e smaller t effectively 
decrease dıç 

e better ‘stop in middle’: 

early stopping 


logo (error) 
' 





102 10 10* 
iteration, t 


ABA, VIMAR ERU ANE? ol Le Div ad attente, 
LE? 


ANDREBINMA f Neural Networkfif. B5, 3i lE FH— RAP ER 

perceptrons RK R5 BS ARAFAT, ERS ESTURME AT — 1 Neural 
Network Hypothesis, JAAA EtA E ExXtfrpatterm extraction, $2 
Bean). BAETIEMENIG, APEL regressioniBE!)6l, RANGE 


MERİ, TABS RİYA RBtanhER STE Atransform function, HERE İİ AYIDAN 
#KAIGDEKASGD, iEiSBackpropagationSiX, FAB AME, RAS 
Ein (w) BMC, DSI ENDESAN GE. Ba, Seel rappt 
a LAH — regularization k B5 ESSA. (GE AGERE NEM 
Dë. (ERBweight-elimination regularizerakz early stopping. 


ZERR- 
ME FTA YE Aİ KE ASA EAA (NESE les A RE 


